An interprovincial input–output database distinguishing firm ownership in China from 1997 to 2017

Input-Output (IO) data describing supply-demand relationships between buyers and sellers for goods and services within an economy have been used not only in economics but also in scientific, environmental, and interdisciplinary research. However, most conventional IO data are highly aggregated, resulting in challenges for researchers and practitioners who face complex issues in large countries such as China, where firms within the same IO sector may have significant differences in technologies across subnational regions and different ownerships. This paper is the first attempt to compile China’s interprovincial IO (IPIO) tables with separate information for mainland China-, Hong Kong, Macau, Taiwan-, and foreign-owned firms inside each province/industry pair. To do this, we collect relevant Chinese economic census data, firm surveys, product level Custom trade statistics, and firm value-added tax invoices and consistently integrate them into a 42-sector, 31-province IO account covering 5 benchmark years between 1997–2017. This work provides a solid foundation for a diverse range of innovative IO-based research in which firm heterogeneity information about location and ownership matters.


Background & Summary
Production fragmentation and specialization within China is an important driver of the country's growth. During this process of production network formation, the Chinese economy also witnessed a significant transformation of trade flow across industries and regions 1 . The multiregional Input-Output (MRIO) model is widely used to assess the impact on growth from a region or sector-specific shock and analyze the structural change in the Chinese economy 2,3 . Recently, province-level and even city-level MRIO tables have been proposed to assist in understanding relevant questions [4][5][6][7][8] . However, in aggregating the production of firms with heterogeneous technologies, the previous subnational MRIO tables were usually compiled under the assumption of homogenous firms within a sector in each region. In addition, most of China's previously published MRIO tables faced an inconsistency issue caused by the discrepancy between aggregated gross regional production (GRP) and national GDP. In recent years, the Chinese government encouraged the disclosure of official micro data in an effort to promote the development of big data 9 , which made it possible to compile MRIO tables with firm heterogeneity. At the same time, the National Bureau of Statistics of China (NBS) revised China's historical GRP data based on the 4th national economic census 10 . By incorporating this update, we can mitigate the inconsistency caused by using different sources of regional IO and statistical data to compile interprovincial IO (IPIO) tables.
It is worth emphasizing why it is attractive to compile an IPIO table with heterogeneity across firms and locations to study China's production network. Unlike small countries, whose production technologies show a low level of variation across different firms and locations, there is strong evidence that an assumption of firm homogeneity may lead to measurement errors and estimation bias [11][12][13] because of the significant heterogeneities in production technologies and energy efficiency, technological and financial endowments, and management know-how across firms in China under different types of ownership (e.g., domestically owned and foreign-owned firms) and by geographical location (e.g., coastal or inland areas).
In the definition of Chinese statistics, foreign invested firms can be further categorized into two groups: (i) Hong Kong, Macao, and Taiwan (HMT) invested enterprises; and (ii) other foreign invested enterprises (FIE). Foreign direct investment (FDI) through these firms has played a significant role in China's rapid industrialization and export miracles. Over the last four decades of China's opening-up, its FDI inflow experienced steady growth, peaking at 290.9 billion dollars 14 in 2013. Despite the impact of the global COVID-19 pandemic and geopolitical tension between the US and China, China remained the world's second-largest recipient of FDI by 2020 (253.1 billion dollars 14 ). FIE-and HMT-invested enterprises have contributed to China's economic growth miracle through various spillover effects, such as branding, sales networks in global markets, technology and managerial know-how transfers, imitation innovations 15 , and human capital accumulation. More importantly, such spillover effects were not distributed evenly across provinces, which further widened the heterogeneities generated by types of firm ownership across China. Distinguishing HMT enterprises from FIEs captures two important features of China's FDI. First, the regional and sectoral distribution of investment by HMT firms are very different from FDI made by firms from developed countries. Second, the investment objectives of HMT firms are often different than those of FIEs. HMT investment is usually concentrated in export-oriented sectors (vertical FDI), while investment made by FIEs is often focused on China's vast domestic markets 16 . Therefore, separately tracing the production and trade activities of FIEs and HTM firms throughout the evolution of cross-province supply chains is of great importance for understanding the technological and environmental spillover effects along China's domestic supply chains and China's future role in global supply chains.
Several studies measuring domestic value added or carbon emissions in China's production and trade have explicitly considered heterogeneity across firm types and trade regimes [17][18][19][20][21] . National IO tables based on firm size and ownership types have also been compiled [22][23][24] . To the best of our knowledge, studies that combine both firm and regional heterogeneities in Chinese economies are rare. The work of Duan et al. 25 was the only to have an MRIO table that captured the firm heterogeneity within a sector in each region in the literature. However, their MRIO tables only distinguished processing and ordinary trade activities and covered 8 regions and 17 sectors. Currently, despite the high demand in the global research community, there is no IPIO table for China that incorporates firm ownership information. This study intends to fill this gap by utilizing the increasingly available micro data.
Using the economic census, industrial firm surveys, product-level customs statistics, and firms' value added tax (VAT) invoice data, we compiled a new set of IPIO tables for mainland China with separate information on domestically owned, HMT-owned, and foreign-owned firms within each industry in every province. This set of tables combined the strengths of IO tables and national account statistics with firm-level micro data, covering 42 sectors, 31 provinces, and five benchmark years between 1997 and 2017. This new IPIO database has the following special features: 1. All IPIO tables are benchmarked to the up-to-date national account statistics published by the NBS of China. 2. The database consistently identifies firm and regional heterogeneities by dividing each province/industry pair in the calibrated IPIO tables by firm ownership. The types of firm ownership are defined by the share of a set of major economic indicators at the province/industry level, which are estimated from firm-level micro data. 3. The link between micro data and aggregate statistics (e.g., sector-level IO tables and national account statistics) is based on a set of systematically developed concordances among various national and international industrial and product classifications. 4. Firm VAT invoices at the transaction level are used to estimate the interprovincial trade flows. 5. The data production process is transparent. The final datasets are duplicable by readers based on a set of well-documented data files, concordances, and computer codes.
It is worth briefly highlighting how these features can benefit future IO-based research. Feature 1 reduced the inconsistency between the sum of GRP and GDP. Unlike the provincial data reported by local governments in each provincial single IO table (SRIO), GRP data estimated by NBS attempt to correct the bias of local statistics that local officials have more incentives to misreport 26,27 . More importantly, NBS revised China's historical GDP and GRP data based on the latest economy census to guarantee consistency across provinces over time 10,28 . By benchmarking IPIO tables to the most up-to-date national account statistics in each province consistently compiled by NBS, we also enable meaningful comparisons over time using the new IPIO tables.
Feature 2 not only overcomes the shortcomings of the homogenous firm assumption underpinning official IO statistics but also helps us better understand the indirect economic and environmental effects of firm behaviors through interregional or inter-sectoral linkages. For example, recent studies on international trade have shown that only a small fraction of enterprises, especially large firms, directly participate in international 1. China's provincial MRIO tables were first benchmarked to the most recent national account statistics published by NBS of China and then were rebalanced and transformed to IPIO tables by using trade statistics by end use categories and VAT invoice data; 2. We estimated the shares of gross output, value added (VA), exports, and imports by three types of firm ownership at each of the 31 province/42 industry pairs from various micro statistics, then split each industry in the IPIO table by the three types of firm ownership. In this section, we introduce all the data sources used to construct our database and illustrate the detailed procedures on how the new IPIO tables were constructed.
Data sources used for constructing IPIO tables by the three types of firm ownership. • Recent national account statistics of China by province This dataset contains GRP data from three accounting approaches (production approach, income approach, and the expenditure approach) covering 31 provinces in mainland China. The production-approach data are classified into 9 sectors: agriculture, forestry, animal husbandry and fishery; industry; construction; wholesale and retail; transportation; warehousing and postal; accommodation and catering; finance; real estate; and others. The income-approach data contain labor compensation, net production tax, depreciation of fixed assets, and operating surplus. The expenditure-approach data are categorized into urban consumption, • VAT invoices data Our VAT invoice data have three billion invoices per year and cover more than 4 million firms across 31 provinces for 2007, 2012, and 2017. VAT invoice data were obtained from the Golden Tax III system of the State Taxation Administration 42 . Approximately three billion invoices covering four to five million firms were digitized in 2012. The database included detailed transaction information obtained from special VAT invoices (see Fig. 1). These invoices contained detailed information regarding commodity and service transactions, including the taxpayer identification number, company name, location of both the buyer and the seller, type of good or service, quantity, unit price, total amount, and VAT rate. Other years (1997,2002,2012,2017) were modified by NBS staff. In addition to those concordances developed by third parties, we also developed several additional concordances in the process of compiling our IPIO tables (more details about concordances can be found in the usage notes).
• Data for estimating related shares by firm ownership to split the calibrated IPIO tables We combine several data sources to estimate the shares of key variables by firm ownership. Table 2 summarizes the four data sources that were used to identify key economic variables (gross output, value added, www.nature.com/scientificdata www.nature.com/scientificdata/ exports, intermediate input) by the three firm ownership types. As seen, the fundamental problem in using micro data (e.g., detailed economic census and ASIF) to estimate shares by firm ownership is that none of the data sources could provide all of the required information over the 20-year time period at the province level. Thus, we combined the four data sources to cover all the benchmark years. Based on the four data sources, we computed the shares by firm ownership of gross output, value added, intermediate input, and export delivery for the years 1998, 2004, 2008, 2013, and 2015. We pick the estimated firm ownership shares for the year closest to the benchmark year as the approximation of corresponding shares to split benchmark IPIO tables for 1997, 2002, 2007, 2012, 2017, respectively. In 2004, China's central government conducted the first national economic census covering major Chinese business and industries. The aim was to collect a comprehensive range of accurate economic data to aid economic analysis and policymaking. After 2008, when the second national economic census was conducted, the census was scheduled to be conducted every five years in conjunction with China's five-year plan. It covers all active firms, irrespective of size or type of ownership. We obtained access to the detailed census data for 2004 and 2008. It encompasses all firms except firms in primary industries in 2008 and all industrial firms in 2004. The number of observations is summarized in Table 3.
The detailed census data for 2012 and 2018 are still not accessible. Therefore, we used ASIF data to estimate the shares of these major economic variables by firm ownership for industrial sectors in the benchmark years of 1997, 2012, and 2017. A summary is given in Table 4.
The ASIF is also conducted by the NBS and includes similar variables as those in the economic census. There are two key differences between the ASIF data and the detailed census data. First, the ASIF data cover a continuous time period, while the economic census is only conducted every five years. Second, only all state-owned or above-scale industrial firms are included in the ASIF. Above-scale firms are defined by a threshold of sales.   Table 2. A summary of data sources for estimating shares by firm ownership. Economic census data (detailed firm level data and summary in yearbooks) and ASIF report the data for the reporting year, while assets investment statistical yearbook reports the data for the previous year.
www.nature.com/scientificdata www.nature.com/scientificdata/ Before 2011, the threshold was 5 million yuan, which increased to 20 million yuan after 2011. Even though the ASIF does not include below-scale firms, its detailed information allows us to estimate the shares by types of firm ownership for industrial firms in those years in which detailed census data are not accessible.
Combining detailed census data and ASIF still cannot cover all sectors for all the years needed. Detailed census data cover only a part of industries in China (2004 does not cover the agriculture and service industries, while 2008 does not cover the agricultural sectors). At the same time, ASIF data only cover industrial sectors. To overcome this missing data issue, we used provincial census yearbooks for 2004, 2008, 2013, and 2018 as supplementary data sources for our estimation. After each national economic census, all provincial bureaus of statistics collect the economic data and publish their provincial census yearbook. The format is similar to the national census yearbook but only covers information within each province. Census yearbooks report the output or sales by firm ownership at the industry level. For benchmark year 1997, when census was not conducted, we used China's Fixed Asset Investment Statistical Yearbook for 1999 as our data source, which included information on national investment in fixed assets in 1998. It provided information on regional (provincial) fixed asset investments by firm ownership in three major industries (primary, secondary, and tertiary). In addition, it also provides information on regional (provincial) fixed asset investments by firm ownership in construction, transport, and real estate. All the provincial census yearbooks and China Fixed Assets Statistical Yearbooks are hard-copy and can be purchased from China Statistics Press 45 .
Benchmark interprovincial IO tables based on key statistics from China's national accounts and their rebalancing. The process of benchmarking and rebalancing the Chinese IPIO tables is summarized by the flowchart (Fig. 2) below. We start by calibrating the national account statistics, followed by benchmarking the provincial IOTs to the calibrated national account data, where the Tibetan IOT was estimated prior to benchmarking if necessary. Then, the interprovincial trade matrices are rebalanced to fit the rebalanced sum of provincial trade in the benchmarked provincial IO tables. Finally, the MRIO tables were converted into IPIO tables. By integrating detailed import statistics by end use and interprovincial transaction aggregated from VAT invoices, we compile China's IPIO tables that are consistent with the IRIO account in the IO literature.

• Calibrate China's national account statistics
As mentioned above, the official national account dataset is not internally consistent as a small gap between the sum of GRP and GDP remains for all five benchmark years. Therefore, we need to calibrate the national account statistics before benchmarking the provincial IO tables. We calibrated the GRP by minimizing the squares error with constraints. Equation 1 below shows how the calibration of the production-approach GRP was done. It was also applied to calibrate GRP calculated from the income-or expenditure-approach in a similar way.    www.nature.com/scientificdata www.nature.com/scientificdata/ where g i r represents the value added of region r, sector i. There are 9 sectors in the production-based value added from NBS. g0 i r represents the initial value of g i r . GRP r is the GRP of region r, which is proportionally pre-calibrated to the GDP (see Eq. 2). G i is the provincial total of sector i's value-added, which is precalibrated as Eq. 3: and were able to align sector data between 2012 and previous years. When a sector needed to be split into two or more sectors, the exogenous proportion used was the ratio of sectoral outputs for Qinghai Province in that year.

• Benchmark the provincial IO tables to the calibrated national account statistics
The original MRIO tables were then rebalanced to fit the value-added data at the province level that were calibrated as outlined in step 1. To do so, we used a consistent method across the years. Here, we take the model for 2017 as an example to explain our approach. The model is specified as follows:   The objective function is designed to minimize the distance between the rebalanced data and the original data using the minimizing cross entropy method. The objective function has five terms. The first term is the column structure of the overall table. There are 46 rows (42 sectors and 4 value added items: labor compensation, net production tax, depreciation of fixed assets, and operating surplus) and 51 columns (42 sectors; 5 final use items: urban consumption, rural consumption, government consumption, total fixed asset investment, and changes in inventories; and 4 trade items: exports, imports, interprovincial outflows, and interprovincial inflows). The second term captures the information on GRP calculated from the income approach in four categories. The third term contains the information on GRP calculated from the production approach in nine industries. Both data are from NBS national account statistics and are believed to be more accurate than the GRP calculated from the expenditure approach (this argument is based on the work experience undertaken by one of our coauthors, who oversaw the national account statistics at the NBS for decades.) To keep the sector structure (the related GRP is calculated from the production approach) between the calibrated and the official values aligned as much as possible, we include the fourth and fifth terms in the objective function. The detailed meanings of the notations of the objective function and its constraints are shown in Table 5.
The first constraint is to maintain the row balance of the IO tables. H r represents the column structure of the IO table for region r, whose elements are h ij r . The term q ctrl r is the column sum control, which equals the total output (total input) and the column sum of the calibrated expenditures. Specifically, the column sums of provincial trade inflows and outflows are not controlled, considering that their statistical quality is lower than those of other expenditure items. The 4th to 7th constraints are used for the structure of the production-approach GRP, which is described in Table 5. To avoid re-export, the sum of exports (ex r ) and provincial outflows (pex r ) should be less than the total output (the 9th constraint). Meanwhile, the regional sum of provincial outflows and inflows (pim i ) for each sector should be zero (the 10th constraint) because the sum of outflow from all regions and the sum of inflow from all regions must equal each other for every sector. Finally, the GRP should be equal to the calibrated production side value added from the NBS national account.

• Rebalance interprovincial trade
As we mentioned before, most previous research relies on data on rail freight transport to estimate the interprovincial trade flow 24,25,31,32 . However, China's ever-improving highway network has made road transport less expensive, and road transport now plays a more important role in China's interprovincial exchanges than rail transport, whose turnover is 6.9 trillion versus 3.3 trillion tons of kilometers in 2021. Thus, the previous interprovincial trade flow estimation method becomes increasingly inaccurate. Therefore, we use unique VAT invoice data at the transaction level from China's taxation authority as the major data source in this study to estimate China's interprovincial trade matrix. The use of VAT invoice data to estimate interprovincial trade linkages has a clear advantage over previous estimation methods based on rail freight data. First, it involves a detailed audit of an enterprise's VAT invoices and tax payment status via China's Golden Tax Project, thereby providing accurate digital transaction data. Second, it covers a wide range of goods and services, much broader than what is covered by railway freight data. Third, it is measured in value of the goods and services traded, rather than that in volume as the railway freight data and provides detailed information of the seller and buyer at China's standard four-digit industry classification (CSIC) for every transaction, thus better satisfying the data needs to compile IRIO tables.
To identify the domestic trade flows between various provinces and sectors, we identified and aggregated firm-level VAT invoices using the following three steps: www.nature.com/scientificdata www.nature.com/scientificdata/ (1) Select transactions valued at more than five million yuan from the raw VAT invoice records.
(2) Extract key information from each VAT invoice. For each VAT invoice, the location at the county and district level, the taxpayer identification number, which included four-digit CSIC, and the total value of the transaction, were collected. Each invoice provided such key information for both purchasers and sellers. The process of adjustment and the structure of the final basic trade flow matrix we developed is shown in Fig. 3. (3) Aggregate the interprovincial trade flow matrix. In theory, the original VAT matrix could be aggregated at the firm level, but in practice, this is hampered by the lack of access to other firm-level data because of commercial privacy concerns. Thus, to enable a comparison of the matrix with existing estimated trade flow matrices, we use the 4-digit CSIC code. The initial aggregated matrix divides economic activity into 58 sectors at the provincial level, and thus, we combined these into 42 sectors based on the classifications used in the IO tables (see Table S1). When the origin and destination shown on the VAT invoice are in the same province, the transaction is considered intraprovincial; otherwise, it is interprovincial.
We integrated the VAT invoice data to re-estimate the interprovincial trade flows. Because the VAT data for service sectors and a few good sectors were sparse, we used the initial interprovincial trade flows in the DRC MRIO tables for these sectors as a supplement.
For the agriculture, mining, manufacturing, and electricity industries, the interprovincial trade matrices estimated from VAT invoice data were used as the initial value to rebalance the interprovincial trade as follows: where h i sr refers to the share of the outflow of sector i's product from region s to region r in the total provincial inflow of region r, and h i sr is the initial value. H i is composed of h i sr multiplied by the sum of provincial inflows (pim i ), which should be equal to the sum of provincial inflows (pex i ).
Notations Meaning , where x r is the total output of region r, and x r is the initial value of x r . cVA r is the ratio of the production-approach GRP taken from the calibrated national account data over that of the initial MRIOs (cVA vaObj va / h ij r is the column structure of the provincial tables, and h ij r is the initial value of h ij r ; Specifically, for the "scrap and waste" sector of some provinces, the rate of value added over the total input (VA rate) in the official input-output table of 1997, 2002 and 2007 are equal to 1, which goes against the economic common sense. To deal with this issue, we take the column structure of the province with the highest VA rate as the initial value of the column structure of the few provinces with VA rate of "1" in 2007. For the year 1997 and 2002, since the VA rates of the "scrap and waste" sector in most provinces equal to 1, we take the column structure of the corresponding provinces in 2007 as the initial value. , where va jp r represents the production-approach GRP sector jp (with 9 aggregated sectors), region r. strprdObj jp r is the objective value of strprd jp r , which is calculated from the calibrated national account data; strprdMax jp r and strprdMin jp r strprdMax jp r is the adjusted structure of production-approach GRP, whose elements equal to − strprd adj max , where adj max is the adjustment value to make strprd less than the larger value between the strprdObj jp r and the official ones (uncalibrated values);Similarly, strprdMin jp r is the adjusted structure of production-approach GRP, whose elements equal to strprd adj min + , where adj min is the adjustment value to make strprd greater than the smaller value between the strprdObj jp r and the official ones (uncalibrated values).
pex i and pim i pex i and pim i represent the sum of provincial outflow and inflow by provinces of sector i, respectively.
where im i q r , refers to the adjusted imports of sector i, category q(q∈{intermediates, consumption goods, capital goods}), and region r. Category q is defined by BEC (more details about the concordance table between HS, BEC, and China's IO can be found in the usage notes section). imM i,q and imB i q r , are the sectoral imports in the rebalanced MRIO tables and the imports in the BEC end use categories, respectively. Specifically, if im i q r , is greater than the local demand for the products of sector i, category q, the excess is proportionally allocated to the other categories.
Second, we calculated the shares of imports, local production, and provincial inflows in the total local use in each province. To calculate the share of imports, we assumed that imports are not used for inventory unless domestic production cannot meet the required changes in inventory. In terms of the allocation of domestic products, a certain share of locally produced products shall be used for local intermediate use, final consumption and capital formation, i.e., Local use except for inventory. We took the share of such local use in total output as the lower bound of the share. Then, the iterative proportional fitting (IPF or RAS) method was used to obtain balanced shares of local production and provincial inflows while the share of imports remained fixed. Here, "balanced" means that the sum of imports, local production, and provincial inflows is equal to total local use based on the constraints of sectoral local production and provincial imports from the rebalanced MRIO tables. The balance is obtained by sector and region. The model used to balance the shares for sector i, region r is as follows:  www.nature.com/scientificdata www.nature.com/scientificdata/ where H refers to the matrix of shares of imports, local production, and provincial inflows in total local use in sector i, region k (see Eq. 8). The RAS method was used to determine the appropriate R and S required to make the initial value of H (H ) meet the constraints. The first constraint is to make the sum of local production (hLp u i q r , ⋅ , the change in inventory is deducted, same below) equal to local production in the rebalanced MRIO tables (Lp i,k ), where hLp i q r , refer to the share of local production, (i is the sector index), while u is the local end use category index for intermediate use, consumption, and capital formation. Similarly, the second constraint is to make the sum of provincial inflows ( ⋅ hPim u i q sr , ) by end use categories equal to the total sectorial provincial inflows in the rebalanced MRIO tables (Pim i sr ). The third constraint is to make the sum of provincial inflows, local production, and imports equal to the total local use categorized by intermediate use, consumption, and capital formation.
Then, the rebalanced H is used to convert the MRIO tables into IPIO tables assuming that the imports, local production, and provincial inflows used for intermediate use are distributed to the sectors in the same ratios (hLp Split IPIO tables by the three types of firm ownership according to their shares of gross output, trade, and value added estimated from firm-level data. After carefully benchmarking the original DRC MRIO tables to the most up-to-date national account statistics and converting them into IPIO tables, we split our IPIO tables by firm ownership estimated from micro data from several sources we described before. Variables for constructing the split IPIO tables include gross output (x), exports (ex), imports (im), value added (va), intermediate transaction (z), and final use (f ). All variables at the province-sector level, which are drawn from our calibrated IPIO tables, are further split by using the firm ownership shares estimated from micro data.

• Gross output, exports, and value added by type of firm ownership
The starting point of constructing firm ownership shares is to estimate the shares of gross output, value added, and export delivery by firm ownership. The key information used to distinguish a firm's ownership type is the firm's registered type ("qiye dengji zhuce leixing" in Chinese) in census or ASIF data. NBS identifies 25 ownership types, including joint ventures between different types of owners. Following NBS' criteria, we classified these 25 types into three major groups, namely, domestically owned, Hong Kong, Macau, and Taiwan-owned, and foreign-owned. Table 6 shows all 25 detailed ownership types. NBS uses a 3-digit code to classify firms' ownership types. The firms whose registered ID commenced with "1" are classified as domestic firms. The firms whose registered type ID starts with "2" are classified as "HMT". The rest of the firms are treated as foreign firms. Firms' registered IDs are given based on the information on their registered capital. Registered capital can be classified into six types: state, collective, individual, legal person, HMT, and foreign. Following NBS' classification criteria, the joint venture firm is classified as an HMT or foreign firm if its share of HMT or foreign registered capital is greater than 25%. Otherwise, it is classified as a domestic firm. We proposed a three-step method to estimate the shares of gross output, value added, and export delivery by firm ownership by using detailed census and ASIF data: (1) We used the registered firm type information to identify domestic enterprises. (2) In addition to all the wholly HMT investment enterprises and wholly foreign investment enterprises, we classified joint-venture firms as either HMT-or foreign-owned if at least 25% of their registered capital was HMT-or foreign-owned, respectively. (3) After classifying all firms included in the detailed micro data into one of the three ownership types, we aggregated the outputs, value added, export delivery value, and intermediate inputs at the ownership-province-IO sector level to calculate the shares by firm ownership of the needed economic variables. Because the census data provide information based on China's standard industry classification (CSIC), CSIC to China's IO sector (CIO) concordances for each benchmark year were used to aggregate the data at the firm level to China's IO industries.
The above 3 steps are the general procedure we used to aggregate the detailed firm-level data to sectors in the calibrated IPIO tables and how the shares of the three firm ownership types were calculated. Ideally, step (1) and step (2) are equivalent since the registered type is consistent with the shares registered capital shares by following NBS' threshold. However, the registered type and shares of registered capital of some observations are inconsistent. Therefore, we only use registered capital shares for classifying joint-venture firms.
It is noteworthy that the 3-step method has several issues that require special treatment. First, there were some theoretically inconsistent values reported for key variables, such as negative output, negative employment, and negative registered capital. We assumed that such inconsistencies were the result of measurement errors because there was only a small proportion of inconsistent observations (less than 0.01%). Thus, we simply omitted them. (2023) 10:293 | https://doi.org/10.1038/s41597-023-02183-2 www.nature.com/scientificdata www.nature.com/scientificdata/ Then, value added was not directly reported. Some of the observations record the value of intermediate inputs, and thus, we used the production approach to obtain VA for these observations as follows: For firms that did not report value added and intermediate inputs, we used the income approach to calculate value added according to Eq. 10: = + + + VA Depreciation Labour compensation Net tax of production Operating surplus (10) Depreciation is recorded in the detailed census, while labor compensation is calculated by the addition of total wages and benefits plus unemployment insurance. The net production tax is calculated according to Eq. 11: Net tax of production VAT Sales tax and extra charges Expenses of taxation production subsidies (11) Here, operating surplus is calculated by the sum of operating profits and production subsidies. Finally, some of the observations did not report information on the firm's registration type. In the 2008 census, approximately 30% of firms did not report their registration type. For these firms, we skipped the step involving checking their registration type and just identified their type by comparing their HMT-and foreign-owned shares of registered capital with the 25% threshold. As mentioned, ASIF data only include above-scale firms, and we need one more step to reduce the bias caused by the exclusion of below-scale firms. We used the type of ownership shares calculated from small above-scale firms, whose features are supposed to be closer to those of below-scale firms, to approximate the shares of below-scale firms. Based on the NBS guidelines, an industrial firm is defined as small or tiny if it employs fewer than 300 people or its annual sales are less than 20 million yuan (only one condition is required to be met, and an above-scale firm can still be a small firm if it meets the requirement). Then, we calculated the weighted average shares by firm ownership, where the weights were the above-scale and below-scale shares computed from the provincial census yearbooks. The calculation of shares by firm ownership S cen was as follows:  www.nature.com/scientificdata www.nature.com/scientificdata/ where S ind is the share of firm ownership type based on all firms in the ASIF and S ind,s is the ownership type share based on small industrial firms in the ASIF. ω above and ω below are the above-scale and below-scale shares, respectively, of output computed from the provincial census yearbook. This correction cannot solve all selection issues. However, given that below-scale firms constitute only a relatively small portion of the Chinese economy (on average, the contribution to the output from below-scale firms is approximately 10%), the bias of the estimated results after our correction should be acceptable.
For the missing sectors in detailed census and ASIF data, we used data from provincial census yearbooks to calculate the ownership type shares of gross output in nonindustrial sectors. The following points are noteworthy: First, most provinces do not report sales or output values in the primary and financial industries by registered type but rather by the number of people employed. Considering the large proportion of employment in domestic firms in these two sectors (more than 98% on average), even a large productivity difference between domestic and non-domestic firms does not significantly affect the shares. Therefore, we used these two sectors' shares of employment by ownership type as a proxy. Second, no production information was reported for government organizations in the census data. Thus, we assumed that all firms in this sector were domestic. Third, for the construction sector, most of the provinces only report output for general contracting firms and specialist contracting firms (i.e., Zongchengbao and Zhuanye Chengbao). These types of firms account for more than 80% of the output of construction firms, and thus, we used their ownership type shares of output as a proxy. Forth, for sectors that only report sales or gross output, all shares were approximated by using the shares of sales or gross output.
Finally, we used China's Fixed Asset Investment Statistical Yearbook of 1999 to fill the nonindustrial sectors in the benchmark IO table for 1997. To capture more heterogeneity at the sector level, we also used the Catalogue of Industries for Guiding Foreign Investment (1997 Revision) published by China's Ministry of Commerce. This catalog shows those industries that are forbidden from accepting foreign investment. For firms in those industries, we assumed that they were domestically owned.
After estimating the shares of gross output (sx), exports (se), and value added (sv) by firm ownership. We construct the three variables as shown in Table 7:

• Imports by types of firm ownership
Our approach to constructing imports by different types of ownership requires detailed trade data in addition to census and firm survey data. The products of industry i imported from abroad by industry j in province r by the three types of firms are estimated as follows: is the share of imported intermediate goods (i) used by industry j of firm type T. It was approximated by the intermediate-use share calculated based on China's economic census and firm survey data. The underlying assumption is that a firm with a high input share also has a high import demand share.

• Intermediate transactions by type of firm ownership and final use
Based on the estimated gross output and exports by different types of firms, the domestic supply by the three types of firms can be calculated. The domestic supply of industry i by firm types D, H, and F in province r is given by  www.nature.com/scientificdata www.nature.com/scientificdata/

• Balancing
Because gross output, value added, and exports by the three firm types are estimated based on firm-level data, we believe that these estimates are reliable and thus keep them unchanged in the balancing procedure. However, the initial estimates of imports, intermediate transactions, and final use by the three firm types are computed based on strong assumptions. This leads to an unbalanced IPIO table at this stage. Next, we update these estimates with constraints of IO account to arrive at a balanced table. To do this, we apply the so-called generalized RAS (GRAS) procedure to the import matrix, intermediate transaction matrix, and final-use matrix with column and row controls (see Table 8).
Finally, as shown in Table 9, we arrive at the balanced IPIO tables that are split into three types of firms at each province/sector pair. It includes 31 provinces. Three firm types are distinguished for each province (D, H, F), and each type of firm engages in production and trade activities in 42 sectors.

Data records
Balanced IPIO tables split into three types of ownership. IPIO tables split into three types of ownership demonstrate the regional economic structure and interregional supply chains for 31 provinces with 42 sectors (40 sectors for 1997) that split into three types of ownership. They cover China's economy for five benchmark years : 1997, 2002, 2007, 2012, and 2017. The layout is shown in Table 9. For each year except 1997, the IPIO table contains an intermediate matrix (3,906*3,906) for the 42 sectors in 31 provinces with three firm types. For the year 1997, all dimensions related to the number of sectors are adjusted by 40 instead of 42. For instance, the intermediate matrix is reduced to 3,720*3,720, where 3,720 = 31*40*3. The final demand of each province is similar to other MRIO tables, which consists of 5 categories, including rural household consumption, urban household consumption, government consumption, gross fixed capital formation, and changes in inventories. The final demand matrix contains 3,906*155 elements for each year except 1997. In addition, exports contain 3,906*1 elements measuring the exports for all 42 sectors in 31 provinces by three firm types, while the import matrix contains 42*3,906 elements measuring the imports and their structure from other countries used by all 42 sectors in 31 provinces by three firm types. Value added includes compensation of employees, net taxes on production, depreciation of fixed capital, and operating surplus, with 4*3,906 elements representing four categories of value added for 31 provinces and 42 sectors with three firm types. The above data and related code can be found in the figshare 46 .  is the value added (in subcategory k) of industry i in province r.   Tables (2017). All of these datasets included 42 sectors and 31 provinces. Four variables, the domestic intermediate-use matrix, the sourcing structure of intermediate inputs (shares of imports, interprovincial inflows, and local inputs), the value added rate (at the provincial and sector levels), and the structures of production-approach GRP and income-approach GRP from China's national account statistics, were involved in the validation.
We followed Steen-Olsen et al. 25,47 , Zheng et al. 48 , and Canning and Wang 49 in comparing the four IO variables with the major existing MRIO tables. Three methods were used to compare the IO matrices: the mean absolute percentage error (MAPE), the Isard-Romanoff similarity index (DSIM), and the absolute entropy distance (AED). MAPE and DSIM are "distance" measures, with both measuring the relative distance between two matrices. MAPE values range from 0 to 100, while DSIM values range from 0 to 1. The lower the value is, the greater the similarity between the matrices is. AED is an information-based statistical measure that reflects the difference between the entropies of the two matrices. The closer the AED value is to zero, the greater the similarity between the matrices.
In general, our IPIO tables are similar to the other three MRIO tables in value-added rate and structure of the income-approach GRP, but with two improvements in sourcing structure and sector structures of the production-approach GRP. The comparison shows that the value-added rates in our calibrated IPIOs are very similar to those in the other MRIO datasets not only at the aggregate level (see second row of Table 10) but also at the provincial and sectoral levels (see Tables 11, 12 for details) because we adjusted the total output based on the changes in the NBS's revised value added at the province/industry level so that the value-added rates, which are more reliable than the total output according to China's statistics methods, are well kept.
In addition, since we also tried to make the structure of the income-approach GRP close to that in the NBS national account data, the comparison also shows that our calibrated IPIO data are similar to the other MRIO datasets at the province level (see Table 13 for details).
Net taxes on production Depreciation on the fixed capital V D Operating surplus Gross Input www.nature.com/scientificdata www.nature.com/scientificdata/ One improvement is seen in the sourcing structure of intermediate inputs, which is reflected in the relatively lower level of similarity among the sourcing structure of intermediate input matrices and shares of each source separately (see the middle panel of Table 10). This is because we improved the sourcing structure based on www.nature.com/scientificdata www.nature.com/scientificdata/ detailed trade statistics aggregated by UN BEC end use categories and use detailed VAT invoice data to estimate interprovincial trade flows, the similarity between our tables and the other tables is expected to be lower. The higher similarity among the other MRIO datasets in the sourcing structure of intermediate input matrices (see second row of Table 14) further reinforced this improvement.
Another improvement is reflected in the sector structures of the production-approach GRP. The sector structure of the production-approach GRP in our IPIO tables is almost identical to that in the most up-to-date national account statistics published by NBS of China, reducing the similarities to other MRIO tables at the sector level for both 2012 (see Table 15 for details) and 2017 (see Table 16 for details).
There are fewer similarities between the intermediate input matrices than those between the other variables (see last two rows of Table 10). The intermediate input structure between our IPIO tables and the other MRIO tables seems somehow more dissimilar than the dissimilarities among the three MRIO tables compared (see last row of Table 14). This is because the intermediate use matrices reflect the sourcing structure differences among our tables and other tables.
Consistency checks among share of firm ownership estimates based on economic census and firm survey data on gross outputs, value added, and trade. Given the high degree of accuracy in the provincial/sector-level data drawn directly from the official census yearbook, our consistency checks mainly focused on the consistency between the aggregated micro-level results from our estimates and the aggregated province/sector results reported in the official census yearbook. Figure 4 shows the main results of the comparisons of the detailed 2008 census data by different types of firm ownership for each province. We compared the estimated shares of output for the different firm types in each province with those calculated using the official provincial census yearbook. Following the NBS's definition of the three major industries, we aggregated the output of the mining industry, manufacturing industry, production and supply of electricity, steam, gas, and water industries, and construction industry by the three firm ownership types to obtain the shares of output by ownership www.nature.com/scientificdata www.nature.com/scientificdata/ type for the secondary industry. Similarly, we aggregated all service sectors except public organizations to obtain the shares by firm ownership type for the tertiary industry. Overall, the estimates were a good fit with the results from the official census yearbook across the provinces for both industries.
For the rest of the years, consistency at the aggregated level is maintained for the tertiary industry because the shares are directly estimated using aggregated data from the official census yearbooks, so we only checked the aggregated results for the detailed sectors in the secondary industries. Since there were no official census was conducted in 1998, we used the shares of fixed investment by type of ownership as a proxy for comparison. Figure 5 summarizes the consistency check results for the rest of the benchmark years. It shows that there is consistency between the official aggregate shares from yearbooks and our estimates based on micro-level data.  www.nature.com/scientificdata www.nature.com/scientificdata/ The first panel in Fig. 5 shows notable more discrepancy between the estimated results from microdata and the shares from census yearbooks; it is because we used the shares of fixed investment as a proxy. Even so, for most of the provinces, the estimated shares from microdata are still consistent with the shares from the China Fixed Assets Investment Statistical Yearbook.
Consistency check of the split and rebalanced IPIO tables with firm ownership information and benchmark data. The major regional account data at the province/industry level in the split IPIO tables with three types of firm ownership and in the calibrated IPIO tables are identical because we used the data from the calibrated IPIO tables as strict aggregation constraints to compile the tables with three types of firm ownership. The regional account data include the following: (1) Gross output at the province/industry level (2) Value added at the province/industry level, both for overall value added and four subcategories of value added (employee compensation, net production tax, depreciation of fixed assets, and operating surplus) (3) Total intermediate imports at the province/industry level (4) Imports for final use at the province/industry level (5) Exports at the province/industry level (6) Total final use at the province level (rural household consumption expenditure, urban household consumption expenditure, government consumption expenditure, gross fixed capital formation, and changes in inventories).
The shares for domestically owned firms, firms owned by Hong Kong, Macau, and Taiwan, and firms owned by foreign countries in gross output and export at the province/industry level computed from the split IPIO tables with three types of firm ownership are consistent with the shares by firm ownership estimated from www.nature.com/scientificdata www.nature.com/scientificdata/ micro data. This is because we use these estimated shares from micro data to split gross output and export by three types of firm ownership and keep them fixed in the balancing procedure. For the estimation of shares in the four subcategories of value added, we encountered missing-data issues. For the 1997, 2002, 2012 and 2017 IPIO tables, we only have micro-level information for industrial firms by three types of ownership and they are consistent with the shares computed from the split IPIO. The information for agricultural firms and service firms are missing. For the 2007 IPIO table, the information for industrial firms and service firms are available, but the information for agricultural firms is missing. To solve this problem, we used the shares of overall value added by the three firm ownership types as a proxy for the shares of value added at the subcategory-level.

Usage Notes
The five benchmark IPIO tables with three types of firm ownership demonstrate the changes in the production and trade pattern among different sectors and regions over 20 years and can be used to analyze provincial economies within China as a tool for both national and regional economic analysis. Furthermore, by including additional columns such as energy use, carbon emissions, water consumption, air pollution, and employment, these benchmark IPIO tables can be used to undertake extensive China-related research on many economic and environmental issues.
In addition to the IPIO tables, our published datasets include related concordances, relevant input data files, and computer code to generate the IPIO tables. Although these datasets are assembled to generate our IPIO tables with the 3 types of firm ownership, they can also be widely used in research on a variety of China-related issues.
(1) Concordances. Three sets of detailed concordance tables were developed to serve as bridges to aggregate the trade data from China Customs and micro data from economic censuses/annual industrial firm surveys to China's IO industries. www.nature.com/scientificdata www.nature.com/scientificdata/ The first set concordance is among the HS, BEC, and China's IO industries (HS-BEC-IO), which is based on the mapping of 8-digit HS codes to the CHN IO sectors developed by the NBS of China (see the NBSHS8toIOsector files in the concordance folder). This set includes tables for each of the five benchmark year, (1997, 2002, 2007, 2012, and 2017). Based on the mapping between BEC categories and the HS subheadings from the UNSD, which is further modified by industrial specialists at the US International Trade Commission (see USITC-BEC-HSrev.xls in the concordance folder) and has been used at the APEC-TiVA project led by both the US and China with the participation of most APEC economies, we were able to aggregate the trade data into three end-use categories: consumption goods, capital goods, and intermediate goods. Imports with China Custom trade codes of 20 (Equipment for processing trade), 25 (Equipment/ Materials investment by foreign-invested enterprise), or 35 (Equipment imported into Export Process Zone) were classified as capital goods, and those with codes of 14 (Process & assembling) or 15 (Process with imported materials) were classified as intermediate goods. Concordances of 8-digit HS to China's IO sector (CIO) for each of the 5 benchmark IO tables are shown as follows Table 17: The second set of concordance tables presents the mapping between the CSIC and IO data and contains five tables, one for each benchmark year. This mapping was undertaken to aggregate the firm-level data in economic censuses and annual surveys of industrial firms (classified by the China System of Standard Industry Classification, CSIC) to China's IO sector classification for comparison with the International Standard Industrial Classification. For each benchmark year, both the aggregated and detailed IO sectors were mapped to the four-digit CSIC code, as shown in Table 18.
The third set of concordance tables was a chained IO sector concordance among the five benchmark years based on the CSIC to IO sector mapping, containing both detailed and aggregated IO sectors. The groupings of IO sectors, as well as CSIC classifications, have undergone significant changes over the 20-year period. The increasing number of detailed CSIC and IO sectors reflects the refinement efforts of industrial classifications made by the NBS of China, making it very difficult to develop a fully consistent IO sector classification that covers all five benchmark years without losing a significant portion of industrial information in the later benchmark years. Therefore, we developed this backward chained IO sector concordance for database users to aggregate the IO sectors at different benchmark years based on their research needs.
(2) Province-IO sector (at both detailed and 42 sector levels) trade data aggregated from China Custom statistics at the 8-digit HS level for 1996-2017, which are distinguished by 5 types of firm ownership www.nature.com/scientificdata www.nature.com/scientificdata/

Code availability
The computer code used to generate the IPIO database with three types of firm ownership for mainland China is based on GAMS and MATLAB. The computer code used to process firm-level micro data and trade statistics is based on STATA. All these codes with detailed instructions have been uploaded in figshare provided by Scientific Data 46 . All codes will also be available at https://github.com/abumazan/Interprovincial-IO-database/tree/main after publication.
All of the data files used to generate the IPIO tables, except the firm-level data and detailed trade statistics at the product level, are available for public access at figshare.

author contributions
Quanrun Chen: (corresponding author for IPIO tables by firm ownership) split and rebalanced the IPIO tables by firm ownership, conducted the technical validation, and drafted the related method and technical validation sections. Yuning Gao: organized drafting a preliminary version of the paper, integrated materials from coauthors into a complete first draft based on the journal submission template, contributed to the background and summary, literature review and data record sections. The contact person to access detailed VAT invoice data from the General Taxation Bureau of China. Chen Pan: (corresponding author for the calibrated IPIO tables) benchmarked and rebalanced the IPIO tables based on most updated statistics from China's national account, conducted technical validation, and drafted the related method, data record, and technical validation sections. Contributor of the DRC MRIO tables. Dingyi Xu: (corresponding author for the use of micro data and firm share by type of ownership estimates) prepared estimates, conducted consistency checks on firm ownership types for major economic indicators at the province/industry level from the economic census, annual industrial firm survey, and custom trade statistics, and drafted the related method, data record, and technical validation sections. Conducted major revisions of the preliminary draft, including incorporating other coauthors' revisions and comments into the final version of the paper and carrying out the online submission. Kun Cai: Secretary of the data construction team. Estimated Tibetan IO tables for 1997, 2002, and 2007, developed part of the related concordances, processed economic census data, and collected data on the differences between industrial firms above scale and all enterprises in six of China's provinces. Dabo Guan: advised the data construction team and provided comments and revision suggestions on the preliminary draft of the paper. Qi He: developed part of the related concordance tables. Shantong Li: advised the data construction team on DRC MRIO tables for China and provided comments and revision suggestions on the preliminary draft of the paper. Contributor of the DRC MRIO tables. Wanqi Liu: described the details of China's economic census data. Bo Meng advised the data construction team and provided comments and major revision suggestions on the preliminary draft of the paper. Zhi Wang: Date construction team organizer. Developed related concordance tables, aggregated China Customs data at the HS8 level into China IO sectors and UN BEC end use categories, advised the data construction team on data reconsolidation issues, conducted major revision of the preliminary draft into final version of the paper. Yang Wang: described micro data available from the NBS-Qinghua data center and helped to collect detailed trade statistics from China Customs. XianChun Xu: advised the data construction team on data issues with the national accounts, regional GDP, and IOT and provided comments and revision suggestions on the preliminary draft of the paper. Peihao Yang: processed economic census data and collected data on the differences between industrial firms above a designated size and all enterprises in nine of China's provinces. Meichen Zhang: processed and described data for the interprovincial transaction matrix aggregated from VAT invoices. Assisted in the major revision of the preliminary draft of the paper. Yuanqi Zhou: cleaned firm-level data from the 2004 economic census, processed economic census data, and collected data on the differences between industrial firms above a designated size and all enterprises in six of China's provinces.